GBE 



Dynamics and Adaptive Benefits of Protein Domain 
Emergence and Arrangements during Plant Genome 
Evolution 

Anna R. Kersting, Erich Bornberg-Bauer, Andrew D. Moore*, and Sonja Grath* 

Evolutionary Bioinformatics Group, Institute for Evolution and Biodiversity, University of Muenster (WWU), Germany 
^Corresponding author: E-mail: radmoore@uni-muenster.de; s.grath@uni-muenster.de. 
Accepted: 9 January 2012 

Abstract 

Plant genonnes are generally very large, mostly paleopolyploid, and have nunnerous gene duplicates and complex genomic 
features such as repeats and transposable elements. Many of these features have been hypothesized to enable plants, which 
cannot easily escape environmental challenges, to rapidly adapt. Another mechanism, which has recently been well 
described as a major facilitator of rapid adaptation in bacteria, animals, and fungi but not yet for plants, is modular 
rearrangement of protein-coding genes. Due to the high precision of profile-based methods, rearrangements can be well 
captured at the protein level by characterizing the emergence, loss, and rearrangements of protein domains, their structural, 
functional, and evolutionary building blocks. Here, we study the dynamics of domain rearrangements and explore their 
adaptive benefit in 27 plant and 3 algal genomes. We use a phylogenomic approach by which we can explain the formation 
of 88% of all arrangements by single-step events, such as fusion, fission, and terminal loss of domains. We find many 
domains are lost along every lineage, but at least 500 domains are novel, that is, they are unique to green plants and 
emerged more or less recently. These novel domains duplicate and rearrange more readily within their genomes than ancient 
domains and are overproportionally involved in stress response and developmental innovations. Novel domains more often 
affect regulatory proteins and show a higher degree of structural disorder than ancient domains. Whereas a relatively large 
and well-conserved core set of single-domain proteins exists, long multi-domain arrangements tend to be species-specific. 
We find that duplicated genes are more often involved in rearrangements. Although fission events typically impact metabolic 
proteins, fusion events often create new signaling proteins essential for environmental sensing. Taken together, the high 
volatility of single domains and complex arrangements in plant genomes demonstrate the importance of modularity for 
environmental adaptability of plants. 
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Introduction 

The wealth of genomic data has governed a number of in- 
sightful studies on genome evolution. To date, most studies 
have concentrated on gene duplications, gene family expan- 
sion or reduction, selective sweeps or signals of selection us- 
ing site-based statistics. An alternative approach to studying 
genome evolution utilizes the modular nature of proteins. 
Most proteins are composed of one or many protein do- 
mains, which are the units of protein structure, function, 
and evolution (Soding and Lupas 2003; Moore et al. 
2008). The majority of proteins can be described using 
a small set of domains, which, despite the ever-increasing 
amount of available sequence data, grows at only moderate 



speed. In contrast, the number of domain arrangements, 
that is, the combination of these domains in proteins, con- 
tinues to rapidly grow (Levitt 2009; Yang et al. 2009). The 
study of domain rearrangements across large phyla has pro- 
vided a detailed understanding of modular protein evolution 
(Bjorklund et al. 2005; Ekman et al. 2007; Fong et al. 2007; 
Wang and Caetano-Anolles 2009; Yang et al. 2009) and has 
demonstrated that domain rearrangements, paired with 
the occasional formation of novel domains (Moore and 
Bornberg-Bauer 2012), create an enormous degree of pro- 
tein diversity (Apic et al. 2001; Levitt 2009; Yang et al. 
2009). The majority of eukaryotic proteins have more than 
one domain (Apic et al. 2001 ; Ekman et al. 2005; Yang et al. 
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2009) , and while nnany donnains are found in few arrange- 
nnents, only few donnains are versatile and fornn a wide array 
of different arrangennents (Weiner et al. 2008; Cohen-Gihon 
et al. 2011). Rearrangement events at the protein level are 
easy to detect, and the key nnechanisms are thought to be 
fusion, fission, and ternninal deletion (Bjorklund et al. 
2005; Weiner et al. 2006). These events are likely fueled 
by a series of underlying genetic events such as nonallelic ho- 
nnologous recombination, exon-shuffling, nonhomologous 
end joining or transposition (Babushok et al. 2007; Buljan 
et al. 2010). However, with few exceptions (e.g., Oshima 
et al. 2010), traces of the genetic mechanisms of rearrange- 
ment swiftly decay. Buljan et al. (2010) explored the genetic 
events that facilitate domain gain events to existing arrange- 
ments. Their results provide support to the notion that 
domains are typically added at either terminus. The key mech- 
anism for such domain gain events involves the joining of 
exons between genes or terminal exon extension. The study 
of domain content evolution in eukaryotes has illustrated that 
domain loss and gain are frequent events (Moore and 
Bornberg-Bauer 201 1 ; Zmasek and Godzik 201 1). Whereas 
lost domains tend to be of catalytic nature, gained domains 
tend to be regulatory. Despite the diverse studies that have 
explored modular evolution across many species as well as 
in restricted clades, to date no study has quantitatively 
addressed the topic of modularity in a set of plant species. 
However, modular evolution may be of particular importance 
for plants, as they face a challenge that many other species do 
not — they cannot easily evade environmental changes 
because of their sessile nature. In particular the fusion of 
genes, and consequently of domain arrangements, allows 
for "jumps" in protein evolution and may govern truly novel 
genetic phenotypes. Hence such fusion proteins may exhibit 
great adaptive potential. Indeed, recent findings suggest that 
chimeric genes formed by gene fusion can be found in 
regions of selective sweeps (Rogers and HartI 2012). 

Fusion events have been shown to be associated with 
regulatory proteins such as the metazoan bHLH transcrip- 
tion factors (Amoutzias et al. 2005) or the MIKC-type 
MADS-box transcription factor proteins in plants (Veron 
et al. 2007; Shan et al. 2009). Innovation of transcription 
factor families is often the result of duplication events, 
which may occur in chromosomal regions with high recom- 
bination rates. Furthermore, it has been illustrated that du- 
plication events in combination with high recombination 
rates are strong forces in genome evolution (Lang et al. 

2010) . 

Duplications have been more frequently described for 
plants than elsewhere and plant genome evolution is special 
in several aspects. First, plant genomes are repeat-rich and 
transposable elements have a particularly prominent role in 
creating retrocopies of genes, for example, in monocots 
(Bennetzen 2005; Baucom et al. 2009; Baucom, Estill et al. 
2009). Second, several whole-genome duplication (WGD) 



events have created many large genomes with various de- 
grees of ploidy within a relatively short period of time. 
35% of all vascular plants are recent polyploids (Wood 
et al. 2009). Moreover, angiosperms have undergone up to 
four rounds of WGD in roughly 320 Myr, with one WGD com- 
mon to all seed plants 31 9 Ma and one WGD common to all 
angiosperms 192 Ma (van de Peer et al. 2009; Jiao et al. 
201 1). Although polyploidy events pose a genomic challenge 
to their host and most polyploidy events are considered 
a "dead end" for evolution (Mayrose et al. 201 1), it has been 
suggested that polyploidy, be it the result of autopolyploidy or 
allopolyploidy, may occasionally provide a starting point for 
evolutionary innovation (Freeling et al. 2006; van de Peer 
et al. 2009). The benefit of an increased amount of genetic 
material might be to allow for swift adaptation to extreme 
environments (van de Peer et al. 2009). For example, the in- 
creased heterozygosity resulting from polyploidy impacts the 
wiring of signaling cascades and can facilitate strong variation 
in gene expression (Osborn et al. 2003). Numerous studies 
have also explicitly explored the impact of WGD in plants 
at the genomic level, for example, by exploring duplicate re- 
tention rates (Hanada et al. 2008; Tang et al. 2008; Zheng 
et al. 2009), gene dosage effects (Freeling et al. 2006; 
Misook et al. 2007; Bekaert et al. 2011), or recombination 
rates (Akhunov et al. 2003). WGDs may enhance the potential 
for diversification and speciation (van de Peer et al. 2009), yet 
the details remain poorly understood. 

As genomic stability is largely influenced by genome size 
and repeat content (Bennetzen 2005), one might speculate 
that plants have high rates of recombination and hence ex- 
hibit a high number of domain rearrangements. Indeed, com- 
parative studies have illustrated that angiosperms exhibit 
higher recombination rates than vertebrates (Kejnovsky 
et al. 2009). However, to date, no study has explored the 
extent of modular protein evolution in plants. 

Given their large genome size, higher recombination rates, 
and the inability to flee upon environmental challenges, it 
seems likely that plants may utilize their abundant genomic 
material to facilitate rapid evolutionary innovation. Conse- 
quently, the benefits of modular domain rearrangements 
might be particularly pronounced, since the ability of modular 
evolution to swiftly implement changes to the protein reper- 
toire may be a key process in both exploiting existing and cre- 
ating functionalities. So far, all studies on the evolutionary 
dynamics and the adaptive potential of domain rearrange- 
ments have been reported for bacteria (Enright and Ouzounis 
2001), metazoa (Ekman et al. 2007), or fungi (Cohen-Gihon 
et al. 201 1), but none for plants. 

In this report, we explore the nature of modular evolution 
in 29 green plant species (Viridiplantae) with taxa ranging 
from green algae to liliopsida and eudicotyledons. Our aim 
is to understand the evolutionary dynamics by studying the 
frequency of individual modular events such as fusion, fission, 
or terminal loss. We apply a maximum parsimony-based 
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approach to reconstruct events placing this study into a phy- 
logenomic framework and quantitatively address the role of 
domain emergence and domain rearrangements. Further- 
more, we explore the speed with which new domains, 
and their arrangements, are gained and lost; how many of 
these events are clade or species-specific and whether event 
"hotspots" can be found amongst the phylogenies of the 
considered species. Finally, we employ several functional 
analyses based on the Gene Ontology (GO) classification 
(Ashburner and Lewis 2002) to shed light on the potential 
adaptive benefits of domain emergence and rearrangements 
during plant genome evolution. 



Materials and Methods 

Proteomes and Domain Annotation 

Comparative analyses of protein domains and their arrange- 
ments were performed on the following 29 plant genomes: 
Arabidopsis thaliana v9.0 (The Arabidopsis Initiative 2000); 
Arabidopsis lyrata v1 .0 (Hu et al. 201 1); Carica papaya v1 .0 
(Ming et al. 2008); Citrus sinensisy^ .0 (Sweet Orange Genome 
Project 2010); Citrus Clementine vO.9 (Haploid Clementine 
Genome International Citrus Genome Consortium 2011); 
Eucalyptus grandis v1 .0 (Eucalyptus grandis Genome Project 
201 0); Mimulusguttatusy^ . 1 ; Aquilegia coerulea; Theobroma 
cacao v1 .0 (Argout et al. 201 1); Glycine max v1 .0 (Schmutz 
et al. 2010); Medicago truncatula v3.0 (Young et al. 2005); 
Lotus japonica v1 .0 (Young et al. 2005); Populus trichocarpa 
v2.0 (Tuskan et al. 2006); Ricinus communis v1 .0 (Chan et al. 
2010); Manihot esculenta vl.1; Malus domestica (Velasco 
et al. 201 0); Prunus persica v1 .0 (International Peach Genome 
Initiative 2010); Ci/ci//T?i/s saf/Va v1 .0 (Huang etal. 2009); Vitis 
vinifera vl.O (Jaillon et al. 2007); Setaria italica v2.0 (Setaria 
italica Genome Sequencing Project 2011); Zea mays v4a.53 
(Schnable et al. 2009); Sorghum bicolor v1.4 (Dubchak 
et al. 2009); Oryza sativa v6.1 (Go et al. 2002); Brachypodium 
distachyon vl.O (Vogel et al. 2010); Phoenix dactylifera v2.0 
(Al-Dous et al. 2011); Selaginella moellendorffii y^ .0 (Banks 
et al. 2011); Physcomitrella patens v1.5 (Rensing et al. 
2008); Chlamydomonas reinhardtii v4.0 (Merchant et al. 
2007); Ostreococcus lucimarinus v2.0 (Palenik et al. 2007); 
and Micromonas pusilla v3.0 (Worden et al. 2009). 

We rooted the tree ~1 .700 Ma by including Trichoplax ad- 
haerens vl.O (Srivastava et al. 2008), Rhizopus oryzae (Ma 
et al. 2009) and Drosophila melanogaster v5.11 (Adams 
et al. 2000). Phylogenetic relationships for all 32 species 
(29 plants and 3 outgroups) used for this study are given 
in supplementary figure 1 (Supplementary Material online). 
If several splice variants were present for one protein, we 
excluded all but the longest transcript. All proteomes were 
scanned for domains with the pfam_scan utility and 
HMMER3.0 (Eddy 2011) against the Pfam-A and Pfam-B 
models obtained from Pfam (v.24) (Finn et al. 2008). 



For the annotation of Pfam-A domains, we used the 
model-defined gathering threshold and query sequences were 
required to match at least 30% of the defining model (Buljan 
et al. 201 0). Pfam-B domains were annotated using an E value 
cutoff of 0.001 (Ekman et al. 2007). Pfam-A domains with clan 
membership were mapped to their clans and domains of type 
"repeat" or "motif" were collapsed into one large domain 
instance (Ekman et al. 2005; Forslund et al. 2007). 

Reconstruction of the Ancestral Domain State; Domain 
Gain, Loss, and Emergence 

We reconstructed ancestral domain contents using a maxi- 
mum parsimony approach as follows: the tree (see fig. 1^) 
was traversed twice, first from leaves to root then from root 
to leaves. Domain presence or absence is determined by ma- 
jority rule. During first traversal (leaves root), the state of 
domain d is set to present at a node n, if d is present in the 
majority of leaves of the subtree rooted in n (leaves of n). 
Similarly, d is set to absent at n, if d is absent in the majority 
of leaves of n. If there is no state majority for d in the child 
nodes of n (i.e., there is an identical proportion of presence 
and absence states for d in the leaves), the state of d at n is 
set to unknown. As traversal continues toward the root, d is 
set to present (absent) at n as soon as the majority of leaves 
of n exhibit the present (absent) state. Ergo, present and un- 
known are resolved to present, while unknown and absent 
are resolved to absent. The first traversal terminates at the 
root node. All unknown states at the root node are set to 
present (note that this root includes the outgroups). During 
the second traversal (root leaves), unknown states are 
resolved by setting them to the state of their ancestor. 
We used a combination of custom-made python scripts 
and the ETE2 package (Huerta-Cepas et al. 2010) for tree 
traversal. Branch lengths of the tree (Soltis et al. 2002; Choi 
et al. 2004; Magallon and Sanderson 2005; Hedges et al. 
2006; Cartwright and Collins 2007; Anderson and JanBen 
2009; Bhattacharya et al. 2009; Bremer et al. 2009; 
Forest and Chase 2009; Herron et al. 2009; Wang and 
Caetano-Anolles 2009; Lang et al. 2010; Reineke et al. 
2011) and whole-genome duplication events (Blanc and 
Wolfe 2004; Schnable et al. 2009; van de Peer et al. 
2009; Jiao et al. 201 1) were extracted from the literature. 

We performed a Blast (Altschul et al. 1997) search to 
identify recently duplicated proteins. Proteins with a similar- 
ity of 75% or more and an E value < 1 0~^° were considered 
to be paralogs. We employed a synteny analysis to distin- 
guish between tandem and segmental duplications. Two 
genes were considered to be tandem duplicates if they were 
five or less genes apart. Paralogs with more than five genes 
between them were considered to be a result of a segmental 
duplication event (Hanada et al. 2008). 

Domain gain and loss events along branches were calcu- 
lated by comparison of domain content at a given node with 
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Fig. 1. — Domain gain, loss, emergence and proteome coverage of 26 plant genomes. (A) Correlation of domain gain and loss with branch length. 
Both gain and loss correlate significantly with branch length (gain: p = 0.6, P < 0.001; loss: p = 0.63, P < 0.001). {B) Phylogenetic relationship of all 
species used in this study. For each branch, the size of the green circle corresponds to the number of domain emergence events along the branch. 
Branches colored in red indicate that the gain and/or loss at this branch is higher than the average gain and/or loss rates. Exact values for domain gain, 
loss, and emergence are given in supplementary table 2 (Supplementary Material online). (0 Domain coverage for proteins. The lower axis (percentage 
of proteins with domains) displays the proportion of proteins with only Pfam-A domains (red), only Pfam-B domains (dark blue), both Pfam-A and 
Pfam-B domains (light blue), and without any protein domain annotation (yellow). The upper axis displays proteome size indicated as vertical black line 
for each species. Statistics for three species {Setaria italica, Prunus persica, and Mimulus guttatus) that are still under Fort Lauderdale restriction are not 
provided. 



the domain content of its ancestor. We distinguish between 
"gained" domains, which are all domains found present at 
a node while absent in its ancestor, and "emerged" do- 
mains, which are gained domains which can only be found 
within Viridiplantae. Ergo, emerged domains are a subset of 
the gained domains. Emerged domains were determined by 
scanning gained domains with HMMER3.0 against NCBI NR 
and IntegrS (Kersey et al. 2005). Gained domains, which are 
not present in the outgroups were also scanned against 
NCBI NR to determine the kingdoms where these domains 
are present (supplementary table 6, Supplementary Material 
online). Domain event rates (gain and loss) were calculated 
by dividing the number of events predicted to occur along 
a given branch by the branch length (in million years). 

Given the evidence that novel domains are frequently en- 
riched in structural disorder (Buljan et al. 2010; Moore and 
Bornberg-Bauer 2012), we predicted disorder in domains 
classified as emerging. VSL2.0 (Obradovic et al. 2005) 



was used to detect structural disorder in domain sequences. 
Emerged domains were divided into four bins (Viridiplantae, 
Embryophyta, Tracheophyta, and Magnoliophyta), corre- 
sponding to their emergence nodes. Domains that emerged 
after the Magnoliophyta node were pooled into one "RE- 
CENT" bin. To compare disorder of emerged domains with 
old domains (i.e., domains that exist at the root), a bin 
"OLD" was constructed consisting of 500 randomly picked 
domains occurring in the root. In addition, we constructed 
a "RANDOM" bin consisting of 100 randomly selected do- 
mains, which exist at the root. To account for sampling bias, 
we repeated the random selection 100 times. Statistical infer- 
ence was conducted with the kruskalmc test of the R package 
pgirmess (Siegel and Castellan 1988; R Development Core 
Team 2008). 

We quantified domain emergence and explored a set of 
attributes (Moore and Bornberg-Bauer 2012). Domain 
frequency, d{f), is defined as the absolute frequency of 
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a domain across all plant genomes used for the analysis. The 
domain rate x(c/) of domain d is defined as the domain fre- 
quency divided by the number of plants in which d occurs. 
The domain success rate corresponds to the domain rate di- 
vided by the node age (in million years) at which the domain 
first emerged. The prevalence P{d) of a domain d is the num- 
ber of plants with d divided by the number of plants with the 
emergence node of d as an ancestor. 

Functional Analysis of Domains 

Where available, GO (Ashburner and Lewis 2002) annota- 
tion of proteomes was obtained from PLAZA 2.0 (Proost 
et al. 2009); Blast2G0 (Gotz et al. 2008) with default set- 
tings was used to functionally annotate the remaining pro- 
teomes. Comparative functional analyses were performed 
by assessing GO-term overrepresentation (overrepresenta- 
tion analysis, ORA) in two separate steps. First, for emerging 
domains, we performed the functional analysis indirectly by 
using the GO annotation of arrangements that harbor at 
least one emerging domain, similar to a previous approach 
(Moore and Bornberg-Bauer 2012). Statistical inference was 
conducted using the R package TopGo (Alexa et al. 2006). 
As universe, we used the GO annotation of all proteins in 
our data set; the sample consisted of arrangements with 
emerging domains. Second, for assessing functional over- 
representation of arrangements in events (such as fusion 
or fission), we again conducted an ORA using the proteins 
GO annotation, however, our sample here was the arrange- 
ment set that results from a specific event (e.g., all gained 
arrangements explainable by a fusion event). P value 
transformed TermClouds were created by logarithmic 
transformation of the False Discovery Rate (FDR)-corrected 
(Benjamini and Hochberg 1995) P value obtained from the 
ORA, such that term size represents the significance of the 
GO term. Visualization was created using Wordle (http:// 
www.wordle.net/) with the transformed Pvalue as a custom 
scaling factor. 

Reconstruction of the Ancestral Domain Arrangements 
State, Arrangement Gain, and Loss 

We defined domain arrangements as ordered sets of 
domains for each protein. For the analysis of arrangements 
in this study, only Pfam-A domains were used. Ancestral 
states for arrangements were reconstructed as previously 
described. Similarly, arrangement gain and loss was 
determined by comparing current and ancestral states. 

Determination of Arrangement Rates 

For each gained arrangement, we applied a search algo- 
rithm to determine the possible mechanism that led to its 
formation. We considered the four most important mecha- 
nisms of modular rearrangements — fusion, fission, terminal 
deletion, and domain addition (Bjorklund et al. 2005; Pasek 



et al. 2006; Weiner et al. 2006; Buljan et al. 2010). The al- 
gorithm assigns a fusion event when two ancestral arrange- 
ments can be fused to form the gained arrangement. A 
gained arrangement is considered to be the result of fission 
if an ancestral arrangement can be split to give rise to the 
new arrangement; both products of the split are required to 
be present in the current node. In contrast, for terminal de- 
letion, only one product of the split (the gained arrange- 
ment) may be present in the current node (the other 
product is considered to be lost). The algorithm counts a do- 
main addition event when the newly gained arrangement 
contains a domain that is absent in the ancestral node. 

Note that in general, any new arrangement can be ex- 
plained by a sufficiently large "chain" of events. However, 
since the likelihood of events is not available, we make 
no assumptions about the relative costs of each mechanism 
and therefore are not able to determine the most likely 
chain. Instead, we focus on single-step solutions, that is, 
on cases where a newly gained arrangement can be ex- 
plained by a single event. Using this strategy, we can differ- 
entiate between arrangements with exact solution (i.e., the 
formation can be explained by exactly one mechanism), ar- 
rangements with nonambiguous solution (i.e., only one 
mechanism explains the arrangement but there are several 
events possible) and arrangements with ambiguous solution 
(i.e., conflicting solutions of different types). All arrange- 
ments with solution are referred to as "simple gains," 
whereas all other arrangements are considered to be 
"complex gains." 



Results 

Domain Coverage 

In plants, on average, 50% of the proteome residues were 
found to be covered by domain annotation; the residue cov- 
erage ranges from 30% to 70% (supplementary table 1 and 
fig. 2, Supplementary Material online). For an average of 
35% of the residues, for each plant, a Pfam-A domain 
can be detected, whereas Pfam-B domains affect 15% of 
all residues. Residue coverage levels for all species are given 
in supplementary table 1 (Supplementary Material online). 

At the protein level, the coverage distribution is more di- 
verse (supplementary table 1 and fig. 2, Supplementary 
Material online). On average, 70% of the proteins for one 
plant species have at least one Pfam-A or Pfam-B domain. 
Fifty percent of the proteins contain only Pfam-A domains, 
14% contain only Pfam-B domains, and 6% contain both 
Pfam-A and Pfam-B domains (fig. 10- All protein coverage 
values are given in supplementary table 1 (Supplementary 
Material online). The total number of proteins containing 
Pfam-A and Pfam-B domains is highly variable between 
the different proteomes (fig. 1C, supplementary table 1, 
Supplementary Material online). 
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Fig. 2. — Gene Ontology (GO) terms associated with emerging domains. GO terms affected by emergence were tested for overrepresentation 
using the TopGO package and all terms present in plants as universe (for details, see Materials and Methods). The font size corresponds to the value of 
significance obtained for this term. Significance was determined after correction for multiple testing using FDR (Benjamini and Hochberg 1 995) correction 
at P < 0.01 . The vast majority of GO terms is related to stimulus response, development, reproduction, regulation, and plant-specific metabolic processes. 



Domain Emergence 

To investigate domain gain, loss, and emergence across the 
considered plants, we reconstructed the ancestral domain 
content at each internal node of the tree (see also Materials 
and Methods; supplementary fig. 1, Supplementary Material 
online). In total, 545 domains emerged in the plant kingdom, 
that is, these domains are exclusively found in Viridiplantae. 
The largest amount of domain emergence within plants 
occurs along the branch leading to Embryophyta, which sees 
the emergence of 262 domains (fig. 1^). A total of 1 14 and 
66 domains emerge along the branches to Magnoliophyta 
and Tracheophyta, respectively. Fifty-one domains emerged 
prior to the split of Embryophyta and the green algae and 
52 domains are the result of recent emergence events and 
can only be found within Magnoliophyta (see also Discussion 
below) (fig. ^B). 

Radiation and Functional Impact of Emerging Domains 

Next, we assessed whether emerged domains confer spe- 
cific functionalities and whether these might provide adap- 
tive benefit. We assessed functional overrepresentation 
using GO categories and TopGO (Alexa et al. 2006) (see 
Materials and Methods for details). We find that GO terms 
prefixed by response_to are overrepresented along with 
functionalities related to reproduction, developmental 
mechanisms, and metabolic processes (fig. 2). 

We binned emerging domains according to their point of 
emergence (for details, see Materials and Methods) and 
ranked them by their frequency d{f). The 5% highest ranked 
domains from each age bin (supplementary table 3, Supple- 
mentary Material online) were subject to further investigation 
as these can be considered to be particularly "successful" 
emerging domains. Among these, we find domains with 
plant-specific functions such as flowering control, auxin reg- 
ulation, fruit development, cell wall development, and plant 
organelle recognition. Furthermore, we detected domains 
related to the F-box protein family, to transcription factors 
and to DNA binding. For the majority of emerging domains, 
direct functional annotation is difficult — the largest propor- 
tion (85%) of all emerging domains in plants are domains of 



unknown function (DUFs) or belong to the set of poorly an- 
notated Pfam-B domains. We assessed functional overrepre- 
sentation using the function of proteins that obtain emerging 
domains — ^we are hence not exploring which functional mod- 
ules emerge but rather which protein functionalities undergo 
innovation (by the addition of an emerging domain). 

There is increasing evidence that young domains can ex- 
hibit higher levels of structural disorder than established do- 
mains (Buljan et al. 2010; Moore and Bornberg-Bauer 
2012). We examined the degree of structural disorder in 
emerging domains. The results indicate that emerging do- 
mains are significantly enriched in intrinsic disorder, more 
than in randomly chosen domains (see Materials and Meth- 
ods; supplementary fig. 3, Supplementary Material online). 
Furthermore, the younger a domain, the higher the degree 
of disorder. 

Domain Gain and Loss 

Domain gain and loss are frequent events in plant evolution, 
and there is a strong variation between different branches 
(fig. ^A). Nevertheless, both gain and loss rates correlate sig- 
nificantly with branch length (Spearman rank correlation, 
gain: p = 0.6, P < 0.001 ; loss: p = 0.63, P < 0.001). On 
average, plants have a domain gain rate of 6.64/Myr and 
a domain loss rate of 6.11/Myr (fig. 1/\, supplementary 
table 2 and fig. 9, Supplementary Material online). In 
monocots, the average domain gain rate (6.7/Myr) is 
lower than the domain loss rate (7.4/Myr), whereas in eu- 
dicots the situation is reversed; eudicots show a loss rate 
of 7.4/Myr and a gain rate of 8.3/Myr (supplementary 
table 2 and fig. 9, Supplementary Material online). Some 
branches exhibit very high loss rates, such as the branch 
leading to P dactylifera, the branches to the two Fabaceae 
M. truncatula and L. Japonica, and the branches to the 
two Andropogoneae Z. mays and 5. bicolor (fig. 1^). 

Gain, Loss, and Distribution of Arrangements 

We next explored the dynamics of arrangement gain and 
loss. After determining the presence/absence of arrange- 
ments at ancestral nodes (for details, see Materials and 
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Fig. 3. — Arrangements shared between species. The dashed line represents the number of arrangements shared by the different numbers of 
species (right axis). The distribution of unique arrangements is roughly bimodal with the majority of arrangements shared by either few or all species. 
The left axis and barplots display the frequency of arrangements with a certain length (one, two, three, four, five, six, and seven or more domains). 
Although single-domain arrangements tend to occur in all species, longer arrangements are often species-specific. 



Methods), we compared arrangement content at each node 
with the content at the corresponding parent node to de- 
termine arrangement gain and loss. As expected, both gain 
and loss rates correlate significantly with branch length 
(Spearman rank correlation, gain: p = 0:56, P< 0.001; loss: 
p = 0.38, P = 0.003, supplementary fig. 5, Supplementary 
Material online). Overall, arrangement gain rate is higher 
than arrangement loss rate. However, both rates correlate 
significantly with each other (p = 0.56, P < 0.001). By 
far, the largest amount of arrangement gain (2,814 arrange- 
ments) occurs along the branch to M. domestica followed 
by the branch to R. communis (1,018). Large amounts of 
arrangement loss can be found along the branches to P dac- 
tylifera (1,028) and L. Japonica (680); both plants also 
showed a high amount of domain loss. All values for ar- 
rangement gain and loss are given in supplementary table 
4 (Supplementary Material online). 

We investigated the amount of arrangements shared by 
different plants species (fig. 3). The distribution is bimodal, 
with the largest number of arrangements being either spe- 
cific to one species (—7,000) or shared by all (~1 ,000); only 
a very small amount of arrangements is shared by 10-20 
species. Although by far the largest proportion of arrange- 
ments shared by all species consists of single-domain proteins, 
the contrary is true for species-specific arrangements. Here, 
the largest number of arrangements tends to be composed 
of more than one domain, with a large proportion contain- 
ing seven or more domains. This indicates that the longer an 
arrangement is, the more likely it is species-specific. 



Modular Rearrangements 

Using a simple model of modular rearrangement (for details, 
see Materials and Methods), we next explored the mecha- 
nisms that can facilitate the formation of novel arrange- 
ments. For this, we considered fusion, fission, terminal 
deletion, and domain addition. The results illustrate that 
70% of all gained arrangements can be explained by exactly 
one solution (exact solutions). Of the gained arrangements, 
14% can be explained by one particular mechanism, how- 
ever, with a number of different possible solutions (nonam- 
biguous solutions); only 4% have conflicting solutions 
(ambiguous solutions). The remaining 12% of all new ar- 
rangements are complex gains that likely arose by a chain 
of events (see Materials and Methods; fig. 4). The different 
events were found to occur with different frequencies 
(table 1). Fusion events makeup the largest proportion of 
exact solutions, followed by domain addition, fission, and 
terminal deletion. Fusion events occur with a frequency 
of 4.59/Myr, followed by fission with 1.98/Myr, and gain 
with 1.89/Myr. Domain deletion events can be split in 
C-terminal and N-terminal domain deletion; both events 
have a frequency of 0.7/Myr. All rates were averaged across 
all branches. We further explored event frequencies across 
different age bins. At the Embryophyta node, 68% of 
new arrangements are affected by domain addition and 
26% by fusion. Domain deletion (4%) and fission (3%) 
are less prevalent at this node. Over time, the frequency 
of domain deletion and fission increases up to 13% and 
21% in recent rearrangements, whereas domain additions 



322 Genome Biol. Evol. 4(3):316-329. doi:10.1093/gbe/evs004 Advance Access publication January 16, 2012 



Modular Evolution in Plants 



GBE 



Table 1 

Contribution of Fusion, Fission, C-Terminal Deletion, N-Terminal 
Deletion and Domain Addition to Simple Arrangement Gains 





Fusion 


Fission 


C-Del 


N-Del 


Add 


Total number 


9,669 


4,073 


1,283 


1,424 


4,848 


Average number/Myr 


4.59 


1.98 


0.7 


0.7 


1.89 



Note. — Del, deletion; Add, addition. 



decrease to a frequency of 24%. The largest fraction of re- 
cently gained arrangements (49%) can be explained by fu- 
sion events (fig. 4). 

Discussion 

Domain Emergence 

The increasing availability of plant genomes has allowed us 
to conduct a comparative domain analysis between a set of 
diverse plant species. Here, we reconstruct the ancestral 
states of domain content and arrangement and investigate 
the functional impact of domain emergence and domain re- 
arrangements across a comprehensive set of 29 genomes 
dating back —800 Myr. However, the considered clade still 
contains a number of species for which genome sequences 
are missing, such as the gymnosperms or the charophyta. As 
these genomes become available, a more comprehensive 
picture of modular evolution in plants will emerge. 

In contrast to animals, plants are sessile organisms that 
are unable to escape strong environmental shifts and must 
rather adapt to such variation. Hence, plants, more so than 
animals, are required to evolve mechanisms in order to deal 
with biotic and abiotic stresses. Here, we illustrate that the 



emergence of new domains can provide an important strat- 
egy for evolving stress response. More than 500 domains 
emerged within Viridiplantae of which more than 100 do- 
mains are unique for Tracheophyta (fig. 1). We recently as- 
sessed the impact of domain emergence in a set of 
insects, where only 30 domains emerged within 1 9 insect ge- 
nomes spanning roughly 300 Myr of evolution (Moore and 
Bornberg-Bauer 201 1). Hence, it would seem that plants ex- 
hibit a large amount of domain innovation. One might spec- 
ulate that plants at least partly address the challenge of 
a sessile lifestyle by means of domain innovation. The inves- 
tigation of GO terms of proteins containing emerged domains 
further supports this notion. A large number of terms are re- 
lated to plant-specific processes, such as megagametogenesis 
and development of plant-specific organs. This is not surpris- 
ing as the reproductive system and morphology of plants not 
only differ strongly from other kingdoms but are also highly 
variable between plant species (Endress 2001; Bennici 2005; 
Williams 2008; Kawakita and Kato 2009). Besides these 
plant-specific functions, a number of overrepresented GO 
terms correspond to response_to categories and to secondary 
metabolite pathways related to stress response, such as auxin 
and jasmonic acid. Such secondary metabolites are strongly 
related to the defense and response mechanisms in plants 
(Grace and Logan 2000; Pateraki and Kanellis 2010; Kerchev 
et al. 2012). As the composition of these compounds is 
variable between plant species and also within species 
(Kroymann 2011), such secondary metabolites may provide 
a strong flexible basis for improving adaptation and defense. 

Functional links to photosynthesis are not found amongst 
emerged domains (fig. 2). This is likely explained by 
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J-^^^^^^" 0% I 

Embryophyta Tracheophyta Magnoliophyta Recent Nodes 

Fig. 4. — Mechanisms of rearrangement across different clades. We applied a search algorithm to assess the mechanisms that might account for 
newly gained arrangements (for details, see Materials and Methods). Only 12% of all gained arrangements cannot be explained by a one-step event 
(complex gains). The remaining 88% of simple gains can be further differentiated into exact solutions where only one particular mechanism (fusion, 
fission, terminal deletion, or domain addition) was necessary to explain the arrangement gain event (70%). All proteomes were divided into four 
different age bins: Embryophyta, Tracheophyta, Magnoliophyta, and Recent Nodes. The frequencies of fusion, fission, and terminal deletion increase 
over time, whereas the frequency of domain addition decreases. 
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photosynthesis not being unique to plants; photosynthetic 
processes can be found in algae and in many species of bac- 
teria (Olson 1970, 2001). Indeed, photosynthesis-related 
GO terms can be detected by investigating gained domains 
which are absent in the outgroups (supplementary fig. 6, 
Supplementary Material online), as well as response_to 
terms and a number of plant-specific functionalities related 
to development, similar to those terms found in proteins 
containing emerged domains. 

Emerged domains seem to be evolutionarily important as 
they have a high prevalence of 0.9-1, indicating that their 
occurrence is strongly conserved. Besides their widespread 
occurrence in nearly all leaves, such emerged domains often 
occur in high copy numbers (supplementary table 3, 
Supplementary Material online). 

Investigating the most successful emerged domains un- 
covers connections to key functional categories such as tran- 
scription factors, binding-related processes, and secondary 
metabolites, including auxin and jasmonic acid (supplemen- 
tary table 3, Supplementary Material online). Indeed, a burst 
of transcription factors and their constituent domains, 
which are assumed to be correlated with increasing com- 
plexity in plant evolution (Lang et al. 201 0), has been found 
in angiosperms. The increase of plant complexity with du- 
plication events (Freeling et al. 2006) may partly be the result 
of duplication facilitating increasingly complex regulatory 
networks (Veron et al. 2007). 

Emerging domains exhibit an increased amount of intrin- 
sic disorder; the more recent the emergence event, the more 
likely the domain in question exhibits intrinsic disorder. Dis- 
ordered sequences may increase the binding affinity of pro- 
teins (Dyson and Wright 2005). High intrinsic disorder paired 
with the fact that emerged domains are significantly under- 
represented in single-domain proteins (hypergeometric test, 
P < 0.01), leads us to the speculation that emerging do- 
mains may have higher interaction potential, which in turn 
may increase their viability and result in higher prevalence 
and frequency. Indeed, some of the most successful emerg- 
ing domains have links to binding-related processes. 

Arrangement Mechanisms 

In plants, roughly 70% of the domain-containing proteins 
are single domain (supplementary fig. 4, Supplementary 
Material online). This high percentage of single-domain pro- 
teins can be an artifact of low domain coverage or "eroded- 
domains," which have diverged beyond detection (Weiner 
etal. 2006). Recent rearrangements can mostly be explained 
by the fusion of two single or two domain proteins. The 
relative rates of fusion and fission are similar to previously 
reported rates (Kummerfeld and Teichmann 2005). GO 
terms overrepresented in proteins, which arose via fusion, 
are stress-, defense-, and adaptation-related as well as 
related to the reproduction system (supplementary fig. 7, 



Supplementary Material online). In contrast, proteins formed 
by fission mainly play a role in metabolic and biosynthesis pro- 
cesses (supplementary fig. 7, Supplementary Material online). 
Proteins shaped by domain deletion are mainly related to ba- 
sic functions such as the primary metabolism, and only a mi- 
nor part of these proteins are stress-response related 
(supplementary fig. 7, Supplementary Material online). 

Our results provide further evidence that duplication im- 
pacts rates of modular rearrangement (Buljan and Bateman 
2009). We find that proteins affected by rearrangement 
events are overrepresented in duplicated genes (supplemen- 
tary table 6, Supplementary Material online). Furthermore, 
we find indication that species with recent WGD have high- 
er rates of fusion and fission in comparison to species with- 
out recent WGD (supplementary table 7, Supplementary 
Material online). In general, duplicates are thought to un- 
dergo one of three different scenarios: subfunctionalization, 
where the two duplicates share ancestral gene function; 
neofunctionalization, where one copy retains the ancestral 
function and the other copy diverges toward a novel func- 
tion; and pseudogenization, where one copy is not ex- 
pressed and is subsequently lost (Walsh 2003). One 
explanation for sub- or neofunctionalization is the loss or 
change of regulatory regions (Ganko et al. 2007). As the 
conservation of noncoding sequences follows an exponen- 
tial decay rate (Reineke et al. 201 1), the retention of both 
duplicates might be the result of the change of one of the 
gene's regulatory region under relaxed selectional con- 
straints. The high retention rate of proteins that result from 
a fusion event might be explained by the conservation of at 
least one regulatory element in the upstream region, 
whereas after fission, one arising protein may lose a regula- 
tory region and undergo pseudogenization followed by 
gene loss. A further reason for sub- and neofunctionaliza- 
tion after duplication might be domain rearrangements in 
one paralog or differential domain loss (Buljan et al. 2010). 

We further illustrate the impact of protein domain rear- 
rangements on an organism's protein repertoire (fig. 5). The 
emerging domains PAN_2 (emerged in the Tracheophyta) 
and S_locus_glycop (Embryophyta) often co-occur together 
with the B-lectin domain. Arrangements containing the two 
emerged domains S_locus_glycop and PAN_2 are frequently 
rearranged within paralogous genes (fig. 5) and obtain a cat- 
alytic function through the addition of kinase domains. Pro- 
teins that consist of arrangements with these two emerged 
domains have GO functions related to the recognition of 
pollen, protein phosphorylation, and cell recognition. Al- 
though we observed fusion events in tandemly duplicated 
genes in our case study, fusion events are not generally over- 
represented in tandemly duplicated genes (supplementary 
table 5, Supplementary Material online). After fusion, dupli- 
cates might be difficult to recognize as paralogs. One might 
therefore speculate that in tandemly duplicated proteins 
fused arrangements are harder to detect. The increased 
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Fig. 5. — Example of two emergent domains at the Tracheophyta node (PAN_2) and Embryophyta node (SJocus_glycope). The evolution of 
example arrangements over time is shown in five different species {Arabidopsis thaliana [AT], Oryza sativa [OS], Populus trichocarpa [PT], Ricinus 
communis [RC], Vitis vinifera [VV]). The observable diversity in arrangements within this family is explainable by simple one-step events of fusion, fission, 
terminal deletion, and domain gain. 



rates of events along more recent branches might be ex- 
plained by WGD which have taken place in angiosperms 
(De Bodt et al. 2005; Freeling et al. 2006; Shoemaker 
et al. 2006; van de Peer et al. 2009; Paterson et al. 
2010). Indeed, in a pairwise comparison of fusion and fission 
rates between plant pairs, which differ by one recent WGD, 
we find increased rates in plants with more recent WGD 
(supplementary table 7, Supplementary Material online). 
Roughly one-third of all vascular plants have undergone 
recent WGDs (Wood et al. 2009). 

Arrangement Distribution 

We investigated the distribution of shared arrangements 
among the plant species. The majority of domain arrange- 
ments are either species-specific or universal (fig. 3). This 
bimodal distribution is even stronger when we consider only 
a well-annotated subset of our species and exclude the 
green algae (supplementary fig. 8, Supplementary Material 
online). In particular, proteins with two or three domains are 
often species-specific. In combination with the observation 
that roughly 70% of all domain-containing proteins are 
single-domain proteins (supplementary fig. 4, Supplemen- 
tary Material online), this can lead to the assumption that 
the fusion of single-domain proteins is a powerful mecha- 
nism to obtain species-specific proteins with new function- 
alities. This distribution suggests that only very few long 
arrangements are highly conserved; long arrangements 
are possibly more often affected by fission events. Proteins 
with arrangements shared by several but not all species are 
overrepresented in GO terms related to basic functions such 
as primary metabolism, cellulose biosynthetic process, and cell 
wall organization. In proteins with arrangements shared by 



a subset of between 5 and 24 proteomes, innate_immune_r- 
esponse is significantly overrepresented, suggesting that 
there might be different pathogens affecting different sub- 
clades. Proteins with GO terms related to reproduction, signal 
transduction, and prefixed with response_to are overrepre- 
sented in species-specific arrangements or those shared by 
only few species. The high number of species-specific ar- 
rangements observed here is in accordance with the observa- 
tion that, within a set of five angiosperm species, around 
20% of proteins do not align to an orthologous group (Pa- 
terson et al. 2010). The high amount of species-specific ar- 
rangements and genes might also be a consequence of 
frequent duplication events followed by lineage-specific re- 
tention (Paterson et al. 2010). This supports the hypothesis 
that plants have many flexible genetic mechanisms to pro- 
duce species-specific adaptation (Bomblies 2010). 

Gain and Loss of Domains and Arrangements 

We investigated gain and loss at the levels of domains and 
domain arrangements by reconstructing the ancestral states 
based on maximum parsimony. We observe that gain and 
loss can frequently be found in all clades in plant evolution 
at both domain and arrangement levels. This is in agreement 
with Buljan and Bateman (2009), who found an equal event 
distribution after speciation and duplication within animals 
and a high amount of change in arrangements after dupli- 
cation events. As we here do not conduct a direct compar- 
ison of paralogs, but instead compare presence/absence 
patterns of domains and their arrangements across pro- 
teomes, our results only support the notion that domain 
gain and loss can be found along all branches and that both 
have a significant correlation with each other and with 
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branch length (fig. ^A). Branches with an increased loss rate 
have, on average, a higher domain gain rate. This high gain 
and loss rate, branch-specific variation and the large number 
of species-specific arrangements show the high variability 
and flexibility with which single-step mechanisms can create 
evolutionary novelties. One might speculate that the high 
gain rate of arrangements in M. domestica (supplementary 
table 4, Supplementary Material online) is caused by the re- 
cent polyploidy event or hybridization as consequence of 
domestication (Velasco et al. 2010). The large amount of 
domain loss in R dactylifera might also be the consequence 
of low sequence coverage (Al-Dousetal. 2011). Differences 
in gain and loss between the different branches might also 
be a consequence of variation in generation time between 
plants. Evidence from studies in Fugu and Tetraodon sug- 
gests that intron loss is increased in species with shorter gen- 
eration time (Loh et al. 2008). Similar patterns have been 
found in Arabidopsis and rice (Roy and Penny 2007). 

Coverage 

The average domain residue coverage is 50%. Protein cov- 
erage varies strongly even between closely related species 
(fig. 10- Three plants belonging to the Fabaceae clade 
are included in this study, G. max, L. japonica, and M. trun- 
catula. Their branches split around 50-60 Ma (Reineke et al. 
201 1). Several events of WGD have been found within the 
Fabaceae clade; all three species share a common WGD fol- 
lowed by additional independent WGDs (Blanc and Wolfe 
2004). These WGDs in connection with different retention 
and pseudogenization rates might explain the variance in 
coverage within this clade. It is also possible that a number 
of plant-specific domains are still not yet described, as the 
number of sequenced plant genomes is still considerably 
lower than the currently available animal genomes. In the 
Fabaceae family, for example, a unique conserved disor- 
dered region has been described in sieve element occlusion 
genes (Ruping etal. 2010; Ernst etal. 2011). Many of these 
family-specific conserved functional sequences might be still 
not covered by Pfam. It should also be considered that ge- 
nome quality between the investigated genomes varies, 
which might be the cause of differences in domain cover- 
age; the most recently sequenced genomes exhibit low cov- 
erage in comparison to longer established genomes such as 
M. truncatula or O. sativa (fig. 10- 

Conclusions 

The results presented here provide, from a phylogenomic 
perspective, multiple insights into the evolutionary dynamics 
of modular rearrangements and the potential adaptive 
benefits in plant genomes. Although around 70% of all pro- 
teins are single-domain proteins and a large fraction of these 
are shared by many species, we observe a very high vola- 
tility of novel domains and arrangements in general. Most 



strikingly, the majority of all arrangements is species-specific 
or restricted to a very small clade. Our phylogeny-based ap- 
proach unravels that the majority of novel arrangements can 
be explained by single-step events such as fusion, fission, 
and terminal loss. Several events of accelerated activity of 
rearrangements and domain emergence could be associ- 
ated to the respective changes in stress adaptation and mor- 
phogenesis. This is particularly pronounced for fusion in 
regulatory proteins. We thus observe a dominant effect 
of rearrangements on adaptation, which is partly driven 
by the high volatility of novel domains. 

Taken together, this study illustrates another layer of 
complexity, which explains how modularity helps plants 
to both create and exploit their abundant genetic material 
in order to accomplish rapid adaptation in response to en- 
vironmental challenges. We propose these results will fuel 
further large-scale experiments. Recent experiments in fungi 
using recombination of libraries of domains from signaling 
proteins (Peisajovich et al. 2010) and the expansion of do- 
main repeats in self-recognition molecules (Chevanne et al. 
2010) have already highlighted the enormous evolutionary 
potential of modularity in protein evolution. Along these 
lines, experiments on plant adaptation should be more 
explicitly geared at furthering our understanding of how 
protein modularity facilitates rapid adaptation. 

Supplementary Material 

Supplementary figures 1-9 and tables 1-7 are available at 
Genome Biology and Evolution online (http://www.gbe. 
oxfordjournals.org/). 
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